Korektor - A System for Contextual Spell-Checking and Diacritics Completion
نویسندگان
چکیده
We present Korektor – a flexible and powerful purely statistical text correction tool for Czech that goes beyond a traditional spell checker. We use a combination of several language models and an error model to offer the best ordering of correction proposals and also to find errors that cannot be detected by simple spell checkers, namely spelling errors that happen to be homographs of existing word forms. Our system works also without any adaptation as a diacritics generator with the best reported results for Czech text. The design of Korektor contains no language-specific parts other than trained statistical models, which makes it highly suitable to be trained for other languages with available resources. The evaluation demonstrates that the system is a state-of-the-art tool for Czech, both as a spell checker and as a diacritics generator. We also show that these functions combine into a potential aid in the error annotation of a learner corpus of Czech. TITLE AND ABSTRACT IN CZECH Korektor – systém pro kontextovou opravu pravopisu a doplnění diakritiky Představujeme Korektor – flexibilní statistický nástroj pro opravu českých textů, jehož schopnosti přesahují tradiční nástroje pro kontrolu pravopisu. Korektor využívá kombinace jazykových modelů a chybového modelu jak k tomu, aby seťrídil pořadí nabízených náhrad pro neznámé slovo podle pravděpodobnosti výskytu na daném místě v textu, tak také, aby nalezl i překlepy, které se nahodile shodují s existujícím českým slovním tvarem. Prostou náhradou chybového modelu náš pracuje Korektor také jako systém pro doplnění diakritiky („oháčkování textu“) s nejvyšší publikovanou úspěšností. Systém neobsahuje žádné významné jazykově specifické komponenty s výjimkou natrénovaných statistických modelů. Je tedy možné jej snadno natrénovat i pro jiné jazyky. Ukážeme, jakých zlepšení náš systém dosahuje v porovnání se stávajícími českými korektory pravopisu i systémy pro doplnění diakritiky. Ukážeme také, že kombinace těchto schopností pomáhá při anotaci chyb v korpusu češtiny jako druhého jazyka.
منابع مشابه
Improvements to Korektor: A Case Study with Native and Non-Native Czech
We present recent developments of Korektor, a statistical spell checking system. In addition to lexicon, Korektor uses language models to find real-word errors, detectable only in context. The models and error probabilities, learned from error corpora, are also used to suggest the most likely corrections. Korektor was originally trained on a small error corpus and used language models extracted...
متن کاملSpell Checking in Spanish: The Case of Diacritic Accents
This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms...
متن کاملOn using context for automatic correction of non-word misspellings in student essays
In this paper we present a new spell-checking system that utilizes contextual information for automatic correction of non-word misspellings. The system is evaluated with a large corpus of essays written by native and nonnative speakers of English to the writing prompts of high-stakes standardized tests (TOEFL and GRE). We also present comparative evaluations with Aspell and the speller from Mic...
متن کاملPersonalized Spell Checking using Neural Networks
Spell checkers are one of the most widely recognized and heavily employed features of word processing applications in existence today. This remains true despite the many problems inherent in the spell checking methods employed by all modern spell checkers. In this paper we present a proof-ofconcept spell checking system that is able to intrinsically avoid many of these problems. In particular, ...
متن کاملSegmentation of touching characters in printed document recognition
Abstraet--A new discrimination function is presented for segmenting touching characters based on both pixel and profile projections. A dynamic recursive segmentation algorithm is developed for effectively segmenting touching characters. Contextual information and spell checking are used to correct errors caused by incorrect recognition and segmentation. Based on 12 real documents, a maximum 99....
متن کامل